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Abstract: The dynamics of information dissem- 
ination in social networks is of paramount impor- 
tance in processes such as rumors or fads prop- 
agation spread of product innovations [2j or 
"word-of-mouth" communications [3, 4]. Due to 
the difficulty in tracking a specific information 
when it is transmitted by people, most under- 
standing of information spreading in social net- 
works comes from models [5] or indirect measure- 
ments [6]. Here we present an integrated experi- 
mental and theoretical framework to understand 
and quantitatively predict how and when infor- 
mation spreads over social networks. Using data 
collected in Viral Marketing campaigns [7] that 
reached over 31,000 individuals in eleven Euro- 
pean markets, we show the large degree of vari- 
ability of the participants' actions, despite them 
being confronted with the common task of receiv- 
ing and forwarding the same piece of informa- 
tion. Specifically we observe large heterogeneity 
in both the number of recommendations made 
by individuals and of the time they take to trans- 
mit the information. Both have a profound ef- 
fect on information diffusion: Firstly, most of the 
transmission takes place due to super-spreading 
events which would be considered extraordinary 
in population-average models. Secondly, due to 
the different way individuals schedule information 
transmission [To ] we observe a slowing down 

of the spreading of information in social networks 
that happens in logarithmic time. Quantitative 
description of the experiments is possible through 
an stochastic branching process [111 ] which cor- 
roborates the importance of heterogeneity. The 
fact that both the intensity and frequency of hu- 
man responses show also large degrees of het- 
erogeneity in many other activities |l2l . Il3l Il4j | 
suggests that our findings are pertinent to many 
other human driven diffusion processes like ru- 
mors, fads, innovations or news which has im- 
portant consequences for organizations manage- 
ment, communications, marketing or electronic 
social communities. 

Each day, millions of conversations, e-mails, SMS, blog 
comments, instant messages or web pages containing var- 
ious types of information are exchanged between people. 
Humans behave in a viral fashion, having a natural in- 



clination to share the information so as to gain reputa- 
tion, trustworthiness or money. This "word-of-mouth" 
(WOM) dissemination of information through social net- 
works is of paramount importance in our every day life. 
For example, WOM is known to influence purchasing 
decisions to the extent that 2/3 of the economy of the 
United States is driven by WOM recommendations [1]. 
But also WOM is important to understand communica- 
tion inside organizations, opinion formation in societies 
or rumor spreading. Despite its importance, detailed em- 
pirical data about how humans disseminate information 
are scarce or indirect [H, [l5[ . Most understanding comes 
from implementing models and ideas borrowed from epi- 
demiology on empirical or synthetic social networks [3,[a]. 
However, unlike virus spreading, information diffusion 
depends on the voluntary nature of humans, has a per- 
ceived transmission cost and is only passed by its host 
to individuals who may be interested on it 0, [l7| ■ Here 
we present a large scale experiment designed to measure 
and understand the influence of human behavior on the 
diffusion of information. 



We analyzed a series of controlled viral marketing Q 
campaigns in which subscribers to an on-line newsletter 
were offered incentives for promoting new subscriptions 
among friends and colleagues. This offering was virally 
spread through recommendation e-mails sent by partici- 
pants. This "recommend-a-friend" mechanism was fully 
conducted electronically and thus could be monitored at 
every step. Spurred by exogenous online advertising, a 
total of 7,153 individuals started recommendation cas- 
cades subsequently fueled through viral propagation car- 
ried out by 2,112 secondary spreaders. This resulted in 
another 21,918 individuals touched by the message which 
they did not pass along further. All in all, 31,183 indi- 
viduals were "infected" by the viral message. Of those, 
9,265 were spreaders. Thus, 77% of the participants were 
reached by the endogenous WOM viral mechanism. We 
call seed nodes the individuals spontaneously initiating 
recommendation cascades and viral nodes the individuals 
who pass e-mail invitations along after having received 
them from other participants. The topology of the re- 
sulting viral recommendations graph (designated as the 
Viral Network) is a directed network formed by 7,188 
isolated components, or viral cascades, where nodes rep- 
resenting participants are connected by arcs representing 
recommendation e-mails (see Fig. [T]). 
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O Data (all countries) 

— Power-law CDF 

— Poisson CDF 



FIG. 1: The viral network detected in the campaigns consists 
of a large number of disconnected clusters as this one found in 
Spain. It has 122 nodes and its diameter (longest undirected 
path) is 13. The structure starts out of a seed participant in 
the center (black) and grows through secondary viral prop- 
agation of viral nodes (gray) until it reaches this large size. 
The probability of finding a similar occurrence in homoge- 
neous random network models (see Figure [3)l is negligible. 



Group 


Nodes 


Cascades 




r v 


A 


s 


s* 


ALL 


31,183 


7,188 


2.51 


2.96 


0.088 


4.39 


4.34 


SP+IT 


6,862 


1,162 


3.14 


3.38 


0.11 


5.99 


5.91 


France 


11,754 


3,244 


2.20 


2.50 


0.070 


3.67 


3.62 


AT+DE 


7,938 


1,743 


2.55 


3.07 


0.095 


4.59 


4.55 


UK+Nordic 


4,629 


1,039 


2.69 


2.79 


0.084 


4.51 


4.45 



TABLE I: The eleven participating countries have been dis- 
tributed in four culturally homogeneous groups for statistical 
relevance. Network parameters of their corresponding viral 
network, shown above, include the theoretical average cas- 
cade size s predicted by the model through equation (1), and 
the real value s* measured in the campaigns. 



The spreading of information or diseases in a p opu- 
lation is often described by average quantities Al- 
though infection and propagation can be quite involving 
processes, population-level analysis describe viral prop- 
agation as a function of the probability of a virally 
informed person to become a secondary spreader (A), 
and of the average number of people contacted by sec- 
ondary spreaders (r). Thus, in this simple approach, 
two parameters fully characterize the mean-field descrip- 
tion of information diffusion: Viral Transmissibility (A) 
and Fanout coefficient (r). fn the viral campaigns we 
found that only 8.79% of the participants receiving a 
recommendation e-mail engaged in spreading, and thus 
A = 0.0879. The Fanout coefficient r, is the average num- 
ber of recommendation e-mails sent by spreading nodes. 
Its value is noticeably higher for viral nodes (r v = 2.96) 
than for seed nodes (r s = 2.51) showing a stronger in- 
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FIG. 2: Upper panel: Fanout cumulative probability dis- 
tribution function for viral campaigns in all countries (cir- 
cles). Solid lines show maximum likelihood fits for power-law 
P(r v > x) = H/(f3 + x a ) (black circles) with H a normaliza- 
tion constant, and /3 = 60.07 and a = 3.50 and Poisson prob- 
ability distribution functions with mean r v (see appendix |A)) . 
Lower panel: Fanout Coefficient for viral (circles) and seed 
(squares) participants as a function of the Viral Transmissibil- 
ity A for different groups of countries. For a given campaign, 
both parameters are linearly dependent as f v = a v \ + b v be- 
cause the participants viral decisions stem from evaluating the 
same utility function. For the campaigns analyzed the linear 
fit results in a v — 21.9 and b v — 0.971. Variation between 
countries is due to a different acceptance of the offering by 
customers in those markets. 



volvemcnt in viral behavior when the invitation to pass 
messages along is received from a trusted source. As a 
result, the average number of secondary cases generated 
by each informed individual is given by the basic repro- 
ductive number R — Arv Both A and t\, also depend on 
the specific country in which the campaign was run (see 
figure^]) but in all cases we found R < 1, i.e. the vi- 
ral campaigns did not reached the "tipping-point" . Since 
the campaign execution was identical in all countries, we 
conclude that differences observed in the propagation pa- 
rameters are due to the varying appeal of the viral offer- 
ing to customers in different markets. However, the data 
suggest a strong linear correlation between the Trans- 
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missibility A and the Fanout coefficient. This peculiarity 
of information diffusion processes, not observed in tradi- 
tional epidemics, stems from the fact that the decisions 
of becoming a spreader and of the number of viral mes- 
sages to send, are taken by the same individual and thus 
are, in average, correlated. As a result, the basic repro- 
ductive number Rq scales at least quadratically with the 
probability of a touched individual becoming a spreader, 
i.e. being convinced to propagate the message. Thus, 
increasing the perceived value of the viral campaign of- 
fer would have a quadratic effect instead of a linear one 
and the tipping-point would be reached for lower than 
expected A values. 

However, average quantities like i?o can hide the het- 
erogeneous nature of information diffusion. In fact we 
find in our experiments that most of the transmission 
we observe takes place due to extraordinary events. In 
particular, we get that the number of recommendations 
sent by spreaders is distributed as a power-law P(r > 
x) x~ a as seen in figure [21 indicating the high prob- 
ability to find large number of recommendations in the 
viral cascades. This large demographic stochasticity has 
been observed in a number of other human activities like 
the number of e-mails sent by individuals per day Q , the 
number of telephone calls placed by users 0] , the num- 
ber of weblogs posts by a single user , the number of 
web page clicks per user 11211 . and the number of a per- 
son's social relationships [13[ or sexual contacts [lij ■ All 
these examples suggest that the response of humans to 
a particular task cannot be described by close-to-average 
models in which they behave in a similar fashion prob- 
ably with some small degree of demographic stochastic- 
ity. For example we find that 2% of the population has 
r > 10, suggesting the existence of super-spreading in- 
dividuals in sharp contrast with homogeneous models of 
information spreading [l!^ |. Super-spreading individuals 
have also been found in non-sexual disease spreading (20l | 
where they have a profound effect. As in that case, we 
find that super-spreading individuals are responsible for 
making large viral cascades rarer but more explosive (see 
figure [3]). For example, if we neglect the existence of 
super-spreading individuals but still consider some de- 
gree of stochasticity in the number of recommendations 
by making r a Poisson process with average 7, a viral 
cascade like the one in figure [1] would have a probability 
of appearance of approximately once every 10 12 seeds, a 
number much larger than the total world population (see 
figure [3|) . 

An important question is whether the observed de- 
mographic stochasticity in the number of recommenda- 
tions is directly related to the heterogeneity of social con- 
tacts [21] . Recent available data about social networks 
has revealed that humans show also large variability in 
their number of social contacts. In particular, it has 
been found that social connectivity is distributed as a 
power-law, much like the number of recommendations in 
our viral campaigns (22J. Moreover, large variability in 
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FIG. 3: a) Cumulative distribution function of the viral cas- 
cades size in all countries (circles) . The solid black line repre- 
sents the prediction of the branching model (see text) while 
the red solid line is the Poisson prediction, b) Average size of 
the viral cascades as a function of the Viral Transmissibility A 
for different groups of countries (circles). The solid line is the 
prediction of the branching model (Eq. [TJ) which diverges at 
the tipping point A c ~ 0.1926 estimated using the linear hts 
of figure [2] for r v and r s ■ The red line and symbols shows r v 
as a function of A. Note that at the tipping point the average 
number of viral e-mails sent is just r v = 5.18. 



the numbers of social contacts have a profound effect in 
information or disease spreading [23|, |24| . Specifically, 
simulations of information or disease spreading models 
on networks show that if information or disease flows 
through every social contact, the topological properties 
of social networks can significantly lower the "tipping- 
point". While this might be the case of computer virus 
spreading or any other kind of automatic propagation 
through social networks, information transmission is vol- 
untary and participants who engage in the spreading con- 
sider the cost and benefits of doing so. Thus, the number 
of recommendations sent by each participant (including 
not sending any) results from a trade-off between the 
information forwarding cost and the perceived value of 
doing it. When the value is low, the average number of 
recommendations can be very low, a small fraction of the 
sender's social contacts which makes the social network 
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topology largely irrelevant in the decision making prob- 
lem. In fact, our data suggest that this is the case; specif- 
ically, most of the viral cascades have a tree-like struc- 
ture while social networks are characterized by the large 
density of local loops [25| • To illustrate this observation 
quantitatively, we have measured the clustering coeffi- 
cient C, i.e., the fraction of an individual contacts who 
are in contact between themselves. Email social networks 
have large values of clustering (C ema u ~ 0.15 — 0.25) plj 
while in our case we find C V i ra i — 4.81 x 10~ 3 . Of course, 
these numbers are not independent: as shown in the ap- 
pendix[C]and under fairly general assumptions we should 
expect that C vira i = C ema ii x 2R /((k nn ) - 1) where k nn 
is the average number of social contacts of the neighbors 
of an individual. In social networks k nn is a large num- 
ber, and then viral cascades have a very small clustering 
coefficient even when close to the tipping-point i? — 1- 
Thus, we have found that reach of information diffusion 
can be very large without sampling the topological prop- 
erties of the social network of individuals. This implies 
that the large heterogeneity observed in the number of 
recommendations is a characteristic of human decision 
making tasks rather than a reflection of the social net- 
work. 

Given the above results, we have modeled the viral 
campaigns recommendation cascades through a branch- 
ing process in which the recommendation heterogeneity is 
considered but the social network topology is neglected. 
Each cascade starts from an initial seed that initiates viral 
propagation with a random number of recommendations 
distributed by P(r s ) and whose average is r s . Touched 
individuals become secondary spreaders with probability 
A thereby giving birth to a new generation of viral nodes 
which, in turn, propagate the message further with r v 
recommendations distributed by P(r v ) with average r v 
[33l ]. The propagation continues through successive gen- 
erations until none of the last touched individuals decide 
to become secondary spreaders. This process corresponds 
to the well known Bellman-Harris branching model [ill ]. 
On average, the infinite time limit cascade size can be 
estimated as 

' = 1 + ~r^w, w 

which arc within a striking 1% error of the experimental 
values found in the viral campaigns (see Table JJ. Not 
only are average cascade sizes well predicted, but their 
distribution is properly replicated when the heterogene- 
ity in the number of recommendations is implemented 
(see figure 13]). Both results show how accurate the model 
can be in predicting the extent of a viral marketing cam- 
paign: since the values of A and f v ,r s can be roughly 
estimated during the early stages of the campaign, we 
could have predicted the final reach of a viral campaign 
at its very beginning. Moreover, giving the knowledge of 
how A and r v are connected and using equation |T]) we 
could give estimations of the critical viral transmissibil- 
ity A c which makes the viral message percolate through 



a fraction of the entire network (34[. We found that 
A c = 0.1926 which correspond to r v = 5.18. Of course 
this is an upper limit to the real "tipping-point" since it 
is based on the assumption that each seed originates one 
isolated viral cascade, which is only valid far from the 
"tipping-point" . The low number of recommendations 
needed to reach the "tipping point" illustrates the lim- 
ited effect of the social network topology in the efficiency 
of viral campaigns. Thus, it is not necessary to send the 
message to each participants' social contact in order to 
reach a significant fraction of the target population. 

Information diffusion dynamics is also affected by the 
different way individuals program the execution of their 
tasks. The time it takes for participants to pass the 
message along since it was received, or "waiting-time" 
r, shows also a large degree of variability: participants 
forward the message after r = 1.5 days on average, but 
with a very large standard deviation of a r = 5.5 days, 
with some participants responding as late as r = 69 days 
after receiving the invitation email (see figure [4]). The 
large variability of the distribution G{t) for waiting times 
observed in our data is consistent with recent measures 
of how humans organize their time when working on spe- 
cific tasks, such as email answering, market trading or 
web pages visits. [1, [26J. Traditional Poissonian mod- 
els for G(t) cannot match the observed data and several 
long-tailed models like power laws [2(| or log-normal [27| 
distributions for G(r) have been proposed to incorpo- 
rate the large waiting-times between actions observed. 
Our data is fully consistent with a log-normal distribu- 
tion and, moreover, the data shows no statistical correla- 
tion with the number of recommendations made by the 
participant (see figure 0]) . This means that the delay in 
passing along a message and the number of recommenda- 
tions made by individuals are largely independent deci- 
sions. Within this approximation, our simulations of the 
Bellman-Harris process with waiting times distributed by 
log-normal G(t) and number of recommendations by the 
power-law P(r) show a remarkable agreement with our 
data from the campaigns (see figure [J}. On the other 
hand, population-average models predict that the aver- 
age number of infected individuals i(t) passing along the 
message at time t is described by the growth equation 

di 

dt = a ° l (2) 

where ao = (i?o — l)/r is the Malthusian rate parameter 
of the population. The number of people aware of the 
information until time t is the cumulative sum of infected 
individuals, s(t) = L i(s)ds. Equation ([2]) is the starting 
point of many different deterministic models to describe 
the evolution of epidemics, information or innovations in 
a population. It also describes the asymptotic dynamics 
of those situations in the models with some mild degree of 
heterogeneity in r [35[ ■ The situation changes drastically 
when G(t) has a large degree of variability. Specifically, 
if G(t) belongs to the so-called class of subexponential 
distributions, i.e. distributions that decay slower than 
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FIG. 4: a) Cumulative probability distribution of time 
elapsedr between the reception and forwarding of the viral 
information (circles) for participants in all countries. The 
solid line shows MLE fit to a log-normal distribution with 
fj, = 5.547 and a 2 — 4.519. Only viral nodes are consid- 
ered, since reception time for seed nodes is undefined. Inset 
shows absence of statistical correlation between the number 
of recommendations made rt and the time elapsed Ti until 
each participant forwards the message, b) Average number 
of touched participants as a function of the cascades start 
time in our campaigns (circles) compared with the prediction 
of the Bellman-Harris model (solid line), with the fitted log- 
normal distribution (black), and with an exponential distribu- 
tion of the same mean (red) . The dashed line is the analytical 
approximation to a Bellman-Harris process with log-normal 
waiting times given by i(t) = 1/(1 — Ar„)[l — G{t)\, where 
G(t) is the cumulative distribution function of the log-normal 
distribution in a). Inset: Remarkable agreement between the 
average size of the viral cascades as function of total campaign 
time in log scale (circles) with the Bellman-Harris model pre- 
diction with G(t) log- normal. Also shown, in red, the predic- 
tion with G(t) exponential. 



exponentially when r — > oo, equation @ is not valid. 
This class contains important instances as power-law (or 
Pareto) distribution, the Weibull or, like in our case, the 
log-normal distribution. In the latter we obtain that for 
i?o < 1, i{t) is given in the long run by 

i{t) ~ fG{r)dr] ~ ^—e-^ 2t /lnt (3) 
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Number of initially infected people 

FIG. 5: Prevalence time tf as a function of number of initially 
infected people (i.e. number of seeds N„) for the Bellman- 
Harris branching process with values of 7?o = Ar„ and r s ob- 
tained in our campaigns for all countries (see table [TJ . Preva- 
lence time is calculated by solving equation i(tj) = 1/N S . 
Solid lines correspond to different distributions G(r): log- 
normal (black) and Poisson (red). 

with a > a constant independent of Ro (see appendix 
|B|) . Equation © demonstrates the deep impact of large 
degree of heterogeneity in our population: the very func- 
tional form of the time dependence is changed and the 
dynamics of the system depends on a logarithmic time 
scale, thus slowing down the propagation of information 
in a drastic way. The situation is the opposite for moder- 
ate values of Rq > 1 where i(t) ~ e at with a given by the 
solutions of Rq J °° e~ at G(t)dt — 1 but with a>ao and 
thus information spreads much faster than expected. The 
different behavior both above and below the "tipping- 
point" is due to the different importance that individu- 
als with small or large values of r have in the dynamics: 
while below Rq = 1 the number of infected individuals 
decay in time up to the point where a sole individual can 
halt the dynamics of a viral cascade, above Rq > 1 the 
dynamics is governed by individuals with small number 
of t which are more abundant than those with r ~ r and 
thus speed up the diffusion. Since subexponential distri- 
butions are found in other human tasks [1, [2|| [2?J , our 
findings have the important consequence that the high 
variability in the response of humans to a particular task 
can slow down or speed up the dynamics of processes 
taking place on social networks when compared to the 
traditional population- average models. 

Our study does not explain why the frequency and 
number of recommendations made by people in our ex- 
periments are so heterogeneous despite the decision they 
faced was the same. Rational expectations suggest that 
individuals should have made their decisions based on 
similar utility functions and then the answers would have 
been closer to each other. The fact that the same degree 
of heterogeneity has been found for so many different 
tasks in humans [1, [2(| H3] suggest that it is an intrinsic 




6 



feature of human nature to be so wildly heterogeneous. 
As we have shown, the main consequence of the large 
variability of human behavior is that population-level av- 
erage quantities do not explain the dynamics of social 
network processes. Important consequences of this large 
variability of behavior are the slowing down or speed up 
of information diffusion and that most of the diffusion 
takes place due to otherwise considered extraordinary 
events. The corrections to population-averaged predic- 
tions go beyond a different set of values for the dynamics 
parameters: They can even change the time scale or func- 
tional form of the predictions. In particular, we have seen 
that we are forced to revisit the way we model spreading 
processes mediated by humans by using differential equa- 
tions like ([2]). On the other hand, the slowing down of 
information diffusion implies that viral cascades or out- 
breaks do last much longer than expected, which could 
explain the prevalence of some informations, rumors or 
computer viruses. For example, if we assume that ini- 
tially N s seeds are infected, we could take as the end 
of information diffusion the point when the fraction of 
infected individuals decays to i(tf) ~ 1/N S . While Pois- 
sonian approximations yield to tj ~ r/(l — i?o) hiiV s , in 
our case we find that tt ~ e VbinN s w j lere 5 > is in- 
dependent of Rq, When N s is large enough there is a 
huge difference between both estimations. For example, 
if N s — 10 4 (a large but moderate value), then tf = 17 
days (with Rq = Xr v ) for Poissonian models while tf ~ 1 
year if G(r) is described by a log-normal distribution. As 
suggested in [28[, the high variability of response times 
can be the origin of the prevalence of computer viruses. 
In fact, our viral cascades span in time longer than ini- 
tially expected, which may render viral campaigns un- 
practical for information diffusion. Companies, organi- 
zations or individuals implementing such marketing tac- 
tics to disseminate information over social networks face 
the following dichotomy: If the tactic is successful and 
information spread reaches the "tipping-point" it does so 
very quickly; however, if it fails in reaching the "tipping- 
point" , the situation is even worse because information 
travels slowly in logarithmic time. We hope that our 
experiments and the fact that they can be accurately ex- 
plained by simple models will trigger more research to 
understand quantitatively human behavior. 



APPENDIX A: MODEL SELECTION 

1. Candidate Models for the recommendation 
distribution 

The recommendation distribution is the probability 
distribution of the number of recommendations r made 
by each participant in the campaign. As shown in figure 
lb, there is a large degree of heterogeneity in the way 
the participants engaged in the campaign. The num- 
ber of recommendations per participant varies from one 
to more than one hundred and thus any modeling of the 
distribution of recommendations has to incorporate those 
extreme events. 

We consider two distinct treatments of the number of 
recommendations : 

1. In order to incorporate demographic stochasticity 
inherent to the transmission process, many clas- 
sical epidemiological models assume that the off- 
spring distribution is represented by a Poisson pro- 
cess, and thus r ~ Poisson((r}). 

2. However, there is an increasing evidence that hu- 
mans tend to respond in a untamed way in different 
activities. Most people behave close to the aver- 
age behavior, but a not negligible portion of hu- 
mans show bursts of activities, like the number of 
e-mails sent per day |22|. the number of telephone 
calls placed by users H, the number of weblogs 
posts by a single user[10|], the time spent between 
receiving and replying an e-mail [a| or the num- 
ber of web page clicks per user [12j. To account 
for those extreme events, power-law distributions 
of activity have been proposed and observed statis- 
tically. Here we propose a model for the number 
of recommendations based on a power-law distri- 
bution r PL(a, 0) which has the following pdf 



which asymptotically decreases like a power law 
and shows a cutoff at small numbers of recommen- 
dations r* ~ /J 1 /". Here, H a p is a normalization 
constant so that P{ r ) = 1- 
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2. Parameter estimation 

We estimate the model parameters by the method of 
moments to ensure that all models have the same mean 
value (r) (and Rq) observed in the campaigns, so that the 
difference between models is due to the different way they 
handle heterogeneity. Note that the Poisson distribution 
has only one parameter and then only (r) can be fitted. 
In the other case, the PL(a, j3), there are two parameters 
and data can be fitted to the first and second moment of 
r as shown in table [Hi We model independently the pdf 
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Group 


r 


r 2 


a 


P 


Seeds 


2.51 


15.2 


3.48 


29.66 


Viral 


2.96 


20.5 


3.50 


60.07 



TABLE II: Parameters of the different probability distribu- 
tion models for the observed number of recommendations 
made by seed nodes and viral nodes. Parameters a and /3 
refer to 



of the number of recommendations made by seeds and 
viral nodes to account for the different r values observed. 
It is interesting to note that both pdfs seem to decay as 
a power law with the same exponent a ~ 3.5. 



APPENDIX B: VIRAL MARKETING 
PROPAGATION DYNAMICS 



2. Model for Viral Marketing propagation 

Applying the Galton- Watson formalism to the viral 
propagation dynamics, we consider a single propagation 
tree starting from one node (Go = 1) whose components 
are all nodes touched by the message. Its total size at 
generation n is F n = J~lj—n Gi arL d the nodes can be 
divided in Active (F^) and Passive (F£ = F n - F 7 f) 
depending on whether they have passed the viral mes- 
sage along or not. Now, we define the Viral Transmis- 
sibility, or the probability of any one node being Ac- 
tive, as A = F^ I F n and the Fanout Coefficient, or av- 
erage number of email referrals sent by Active nodes, as 

r v = [Xm=i r n]/Fn where r n is the number of email re- 
ferrals sent by node n. Now the average number of email 
referrals sent by all nodes (Active or Passive) is 



1. The Galton- Watson branching process 

Branching processes describe the evolution of systems 
where an initial set of objects called the 0-th genera- 
tion reproduce themselves into a set of children of the 
same kind call the first generation and so on through 
successive generations. The Galton- Watson process is 
the simplest mathematical description of such situation 
and only keeps track of the sizes of the successive genera- 
tions, not the times at which individual objects are born 
or their individual family relationships. We can define 
two sets of random variables {G„} — {Go, G%, G2, ■—} 
with G n being the number of individuals in generation n 
and {F n } = {F ,F 1 ,F 2 ,...} with F n = £" =0 G*. Sincc 
the probability law governing each generation does not 
depend on the sizes of the preceding generation, both 
form a Markov Chain. 

The probability distribution of the variable G\ is given 
by P{G\ — k) — pk and we can define its probability 
generating function (pgf) f(s) as 



f(s) = E^ sr 



(Bl) 



n=0 



whose derivative evaluated at s = 1 is the expected value 
of Gi as follows 



(Gi 



/'(I) 



E 

n=0 



(B2) 



It was demonstrated by Watson [32j that the generating 
function of G„ is f n (s), the n-th iterate of the generating 
function /(s), as follows 



/nOO =/{/[-../(*)...]} 



(B3) 



This important property leads to the following result for 
the average size of the n-th generation: 



(Gn) = /;(!) = (/'(!))" =™* 



(B4) 



E^ = T^E 



1 



E r "- E r " (B5 ) 



Fn - 

— r v = ^ r v 

since summation over Inactive nodes is zero. In our 
mean-field approach, this value will be considered to be 
constant throught all generations. 

Now, the probability function of the Galton-Watson 
process is given by po = 1 — A, p r {l,2, ....} where p r is 
the power-law distribution in (|A1|) with ~Y^L a p r = 1, 
J2T=iPr — ^ an d J2T=o r Pr — XFv- The corresponding 
generating function is 



00 

f( S ) = l-X + J2PrS r 

r=l 



(B6) 



and applying the Galton-Watson process results in (|B2I 
and (|B4p we write the average size of each of the gener 
ations in the propagation tree as 



00 



(Gx) = R = f'(l) = > > r = Ar„ 



E' 

r=0 



and 



(G n ) - //.(I) - [/'(I)]" = i?o = (Ar„)" 



(B7) 



(B8) 



hence, the average size of a branch in the mean-field ap- 
proach at the infinite time limit is given by 

00 00 00 1 

Foo = (J2 G n ) = j2(G n ) = E( A ^r = 

r Ar 7J 



n=0 



n=0 



n=0 



since the summation converges because the system is be- 
low the percolation threshold and Xf v < 1. Now, the 
total number TV of nodes in the Viral Network graph in 
the infinite time limit results from adding the nodes in 
the r s trees generated by each seed node and multiplying 
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by the total number N s of seed nodes. Thus we have, 
seed nodes included, that 



where we have used that 



N = N S + NsTvF^ = N s 1 + 



1-Ar, 



(BIO) 



where the validity condition of being far from the per- 
colation threshold is necessary to ensure that outbreaks 
(or clusters) originating from different seed nodes do not 
merge with one another. 



3. Age-dependent dynamics: Bellman-Harris 
process 

The description of viral marketing dynamics based on 
the Galton- Watson process does not consider the "wait- 
ing time" (r) elapsed between the reception of a message 
and the moment its passing along, assuming implicitly 
that both actions take place at the same instant. How- 
ever, viral propagation does not occur instantaneously 
and our experiments show that it follows a log-normal 
time distribution much like those observed in other hu- 
man activities. 

To describe this behavior we will use the Bellman- 
Harris process, a continuous time generalization of the 
Galton- Watson one, in which both the number of de- 
scendants at each generation and their lifetimes are rep- 
resented by non-negative, independent random variables 
[32| . It is described as follows: A single ancestor is orig- 
inated at t = and lives for time r which is a random 
variable with cumulative distribution function G(r) with 
mean f. At the moment of its disappearance the particle 
generates a number r of progeny according to a prob- 
ability distribution P(r) whose pgf is denoted as f(s). 
The process continues with descendants behaving inde- 
pendently and in the same fashion as their ancestors did. 
Thus, the branching process is described by the random 
variable Z(t) representing the number of active particles 
at time t. In our case, Z(t) represents the number of ac- 
tive participants at time t, i.e. the number of people that 
have received the information before time t and that will 
send it in a future time. 

Analytically, we use the generating function F(s, t) for 
calculating the probability of having Z(t) particles active 
at time t. It is defined as 



F(M) = £ P ( Z W=*) s* 



(Bll) 



i=0 



It can be proved [321 ] that F(s, t) in the asymptotic limit 
satisfies a renewal equation of the form 

/>oo 

F(s, t) = s[l - G(t)} + / dG{r) f[F(s, t - r)\ (B12) 
Jo 

As a result i(t), the expected value of Z(t), verifies that 

i(t) = ^(M) = l-G(t)+Ro [ dG{r) i(t-r) (B13) 
ds Jo 



df[F(s,t-r)] 



ds 



df(s) 



ds 



dF(s,t 



8=1 



ds 



R i(t-r) 



s=l 



(B14) 

General explicit solutions of the integral equation (|B13|) 
do not exist, although the asymptotic behavior is known 
in the case in which the Malthusian parameter a of the 
population exists. This parameter is defined explicitly by 



Rn 



l dG{t) 



1. 



If a solution of this equation exists, then [32 



i(t) ~ Ce c 



C = 



Rn 



1 



aR 2 /°° te~ at dG{t) 



(B15) 



(B16) 



The normalization of G(t) implies that, if exists, a > 
for Rq > 1 and a < for Rq < 1 thus recovering 
the exponential growth or decay above and below the 
"tipping-point" . Important instances of this case are: 

1. Galton- Watson process. For G(t) = x(t ~ t), 
where x(t) is the unit step function at (i.e., lifes- 
pan of all particles is identical and equal to r) , we 
recover a Galton- Watson process with progeny gen- 
erating function f(s) and mean 



i(t = nr) = R, 



t/r 



(B17) 



which yields to equation (|B10[) since i?o = Ar„. 

2. Markov age-dependent branching process. 

Traditional modeling of the lifespan or "waiting 
time" of human activities implies that G(t) is of 
the Poissonian type G(t) = 1 — e~ l l T . One of the 
important reasons is that this exponential distri- 
bution has the lack-of-memory property which is 
suitable for modeling the dynamics using Marko- 
vian processes. This is exemplified in our case by 
the fact that, if G(t) is exponentially distributed, 
then the solution of (|B13|) is exactly given by 



i(t) 



,a t 



a 



Ro-1 



(B18) 



Note that both cases correspond to the basic Markovian 
growth models of epidemic transmission in which the av- 
erage number of infected people grows or decays expo- 
nentially within a time scale proportional to the average 
lifespan of infected individuals. 

However, the Malthusian parameter of the population 
does not exist when Rq < 1 for a broad and important 
class of distributions called sub- exponential distributions: 
a probability distribution with cdf G(t) defined on [0, oo) 
is said to be subexponential if G* 2 (t) ~ 2G(t) as t — > oo 
where G(t) = 1 — G(t) and G* n denotes the n-fold con- 
volution of function G{t) by itself. As a consequence of 
this asymptotic behavior, the integral in (|B15jl does not 
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exist for a < which means that the pdf of this class of 
distributions decays slower than any exponential when 
t — > oo. Important instances like the Pareto, log-normal 
and Wcibull distributions belong to this category. In this 
case, the solution of (|B13j) is a non-Markovian and the 
usual modeling of epidemics in terms of growth equations 
or differential equations fails: in particular, the knowl- 
edge of how information has been diffused until time t 
does not determine the dynamics for longer times. The 
general asymptotic behavior of equation (|B13[) is known 
to be of the form [31j 



f 



1 - R 



G(t), 



(B19) 



and thus the number of infected people decays like the 
tail of the distribution. 

We have analyzed the evolution of viral campaigns and 
found that the average cascade size as a function of time 
s(t) = J Q i(r)d,T can be modeled with remarkable preci- 
sion by a Bellman-Harris process as in (|B19|) with G(t) 
lognormal. Thus, instead of observing the usual expo- 
nential decay of active people i(t) ~ e at the active viral 
population evolves as 



i(t) 




(l-Ro)y/2n klt-fi 



(B20) 



(B21) 



for large t. The asymptotic behavior depends then on a 
different time scale (logarithmic in time In t) rather than 
the normal time scale t, a result that highlights the failure 
of typical modeling to explain observed behavior when 
the variability of humans is so large than it is described 
by a subexponcntial distribution. 

Note that the influence of the log-normal distributions 
of waiting times occurs even at the population average 
level and not only on fluctuations around the average 
value i(t), i.e., it changes the dynamics not just quanti- 
tatively but also qualitatively. Finally, the dynamics is 
slowed down by the high probability of finding an indi- 
vidual with large response times, as the logarithmic time 
scale in our case shows. 

For i?o > 1 the Malthusian parameter exists for the 
class of subexponential distributions and then i(t) grows 
exponentially like i(t) ~ e at . But, even in this case, there 
is a large quantitative difference between the solutions 
of equation (|B15|) and the values expected by assuming 
exponential distributions. As shown in figure [HI the dif- 
ference in our case can be of one order of magnitude 
which implies that if the campaign reaches the tipping- 
point the information spreads much faster than expected. 
For example, if Rq = 2 and using the values of r ~ 1.5 
days obtained in our campaigns we should have expected 
an exponential growth with time scale Uq 1 = r ~ 1.5 
days, while in the case of a log-normal distribution we 
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FIG. 6: Malthusian parameter of the population above the 
"tipping-point" as a function of the average number of sec- 
ondary cases for different distributions of G(t). 



?ct a 



o 



7 hours. This large quantitative difference 
is due to the fact that subexponential distributions are 
more skewed than the Poisson ones and thus there is 
a higher probability of finding participants with small 
"waiting-times" (compared to the mean) in subexponen- 
tial distributions. Those fast responders are responsible 
for this exponential growth with shorter time scale. 



APPENDIX C: INFERENCES ON THE 
SUBSTRATE E-MAIL NETWORK 

The e-mail Network serving as substrate of the viral 
messages propagation is formed by individuals (nodes) 
and by their e-mail connections (links between nodes) 
as determined by the addresses listed in their e-mail ad- 
dress books. In their propagation, viral messages can 
only go through the links in the e-mail Network and the 
viral network is thus a subset of it. We have observed 
however, that even when viral propagation has fully per- 
colated, the substrate e-mail Network is not readily per- 
ceived through observation of the Viral Network. 

Nevertheless, because both networks are related, some 
parameters in the e-mail Network can be gleaned through 
measures on the viral network. We prove here that in 
a viral propagation process the clustering coefficients of 
the substrate network (the e-mail Network) and of its 
virally percolated subset (the Viral Network) are corre- 
lated and derive, based on a mean-field approximation, 
an expression of such correlation. The clustering coef- 
ficient, according to Watts and Strogatz [30|, is defined 
as 

q 1 number of triangles connected to node i 
N number of triples centered on node i 

where "triple" means a single node with edges running 
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to an unordered pair of others. If such pair is also con- 
nected, it forms a triangle or "transitive triad". Now we 
can write, in a mean-field approximation, the clustering 
coefficients of the e-Mail and Viral networks respectively 
as 



C email » T 



N e 

E 



(triang ema ii)i ^ {(triang emai i) t ) 



viral 



Nv 

V 



(triang V i 



((triang v 



iralji) 



N v 4^ (triples vira i)i ((triple 



(C2) 



(C3) 



Considering an e-mail Network node connected to tri- 
angles and triples, we can watch the bond percolation 
progress of a viral message planted on it. The probabil- 
ity of a triangle on such node being fully percolated by 
e-mails is the joint probability of percolation of each of 
the edges in the triple and of the link between the two 
neighbors at the end of them which forms the triangle 
third side 

P(perc-triang.) = P(j>ercJLriple) x P(percJ$rdside) 

As a result, we can estimate as follows the average num- 
ber of triangles and triples in the Viral Network with the 
mean-field approximation 

((triang V irai)i) — P(percJriple) x P(percJSrdside) x 
x((triang email ) t } (C5) 

((triples vira i)i) = P(perc_triple) x ((triples ema a)i) 

(C6) 

Combining JC2|, JC3J, jC5} and |C6} we obtain 



P(perc-3rdside) x 



((triang ema a)i) 

((triples email) i) 



(C7) 



Considering that the clustering coefficient is calculated 
for non-directed networks (i.e. arcs in the e-mail Net- 
work are assimilated to undirected edges), that nodes 
reached by the viral message become active with proba- 
bility A (the Transmissibility) and that, after becoming 
active they send messages with Fanout r v each, we con- 
clude that the probability for the third side of the triple 
being percolated by a viral message, so as to close a tri- 
angle, is given by 



P(percJ!>rd_side) = — 



2Xr v 



2i?o 



(k r , 



- 1 



(K 



l 



(C8) 



where (k nn ) e is the average over the email network of the 
nearest neighbors average degree. It has to be decreased 
by 1 because the propagation rules do not allow messages 
to be sent back to ancestor nodes. The factor 2 results 
from the fact that either of the two nodes at the open 
end of a triple can send the message that closes the cor- 
responding triangle. Substituting (|C8|I and (|C2|I in (jC7[) 



O Viral propagation model on real email network j 



R 



FIG. 7: Clustering coefficient C V i ra i for the viral cascades 
obtained through simulations of the viral propagation model 
on a real email network (symbols) compared with the lineal 
relationship given by equation (|C9p . The email network has 
C emal i = 0.2202 and (L) = 18.903. 



we arrive to the relationship between an e-mail Network 
clustering coefficient and that of its virally percolated one 



a 



2i?n 



viral 



i 



x C, 



email 



(C9) 



This expression has been tested through simulations of 
the viral propagation model on a real email network gath- 
ered from email server logs of a Spanish university (29l | 
(see figure [7]). In the model, any node becomes a sec- 
ondary spreader with probability A and transmits the 
message among r of his/her email connections (if possi- 
ble) with average r v number of recommendations. While 
the real network has a rather large clustering coefficient 
C ema ii ~ 0.22, the resulting viral cascades have a very 
small clustering coefficient even for large probabilities A 
of getting infected. This low values of C vira i justify the 
assumption made in our model that the social network 
is largely irrelevant to understand the dynamics of infor- 
mation propagation below or even close to the tipping 
point. 



APPENDIX D: VIRAL CAMPAIGNS GENERAL 
DESCRIPTION 

The following describes in some detail the technical 
and marketing aspects involved in the execution of the 
Viral Marketing campaigns utilized as source of the viral 
propagation data used in our studies. It covers 16 differ- 
ent campaigns executed in 11 European countries, all of 
them with the same structure, strategy, user interfaces, 
data flow or participants conditions. 

The primary marketing objective of the viral campaign 
was to increase the number of subscriptions to the com- 
pany on-line newsletter, and the offering consisted in the 
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free subscription to such newsletter which can be cus- 
tomized according to the subscriber's interest who was 
asked to choose from a list of available generic topics rep- 
resented by interest codes. The subscription was formal- 
ized by filling in a form located in the main campaign web 
page (a.k.a. registration page) of the campaign. A series 
of drive-to-web tactics, variable by country, was put in 
place to attract visitors to the registration page. This in- 
cluded e-mail campaigns, banner advertising, search en- 
gines placement, promotion at the company web site and 
other web based promotional activities. 

Additionally, a viral propagation tool consisting of a 
button located at the registration page was established 
to trigger the message propagation. The caption in that 
button invited visitors to recommend the page to friends 
and colleagues and offered, as additional incentive for 
people to forward the page, tickets for a prize draw to win 
a laptop computer. Two situations caused participants 
to become eligible to receive prize draw tickets: 



• One ticket was assigned to participants sending 
any number of recommendations to friends or col- 
leagues 

• Unlimited number of additional tickets were given 
to the sender for each of the recommended friends 
who would, as a result of such recommendation, 
subscribe to the newsletter 



The ticket eligibility rules above were designed to dis- 
courage spam-like behavior where recommendations are 
sent indiscriminately to individuals not interested in the 
offering all the while they encouraged to send the highest 
possible number of recommendations to individuals pre- 
sumed to be interested in the newsletter. Additionally, 
the participation rules guarantees that the incentive was 
direct consequence of the viral message propagation and 
not of registration to the newsletter. 
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